Robert Turner, University of Sheffield RSE Team September, 2021
Heavily based on Reproducible Research Data and Project Management in R by Anna Krystalli, naming things by Jenny Bryan and Methods in Research Software Engineering by David Wilby.
Mix of software engineering and research experience.
13 RSEs, 35 projects / year worth ~£11m total
Practical advice on:
What operating system(s) do you use?
What programming language(s) do you use?
Some years ago, Tom Webb (@tomjwebb) asked for advice on Twitter. Some of the resulting conversation is included in this presentation…
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— Oceans Initiative (@oceansresearch) January 16, 2015
Take initiative & responsibility. Think long term.
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
Do you agree?
THRILLED by this announcement by the Human Gene Nomenclature Committee. pic.twitter.com/BqLIOMm69d
— Janna Hutz (@jannahutz) August 4, 2020
But good for data viewing / entry, sometimes, perhaps…
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
Have a look at the Data Carpentry SQL for Ecology lesson
.csv: comma separated values..tsv: tab separated values..txt: no formatting specified.@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
more unusual formats will need instructions on use.
Andrea De Santis, unsplash.com
.csv or .tsv copy would need to be saved.Use good null values, missing values are a fact of life:
NA or NULL are also good options0. Avoid numbers like -999@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
Raw data are sacrosanct
Photo by Jon Moore, unsplash.com
Photo: Pexels CC0
main copy of files@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015
RNO
myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt
YES
2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt
What makes a good file name?
In the following:
ls -lh *Plasmid*
*Plasmid*
is a glob.
Deliberate use of "-" and "_" allows recovery of metadata from the filenames:
"_" underscore used to delimit units of metadata I want to access later"-" hyphen used to delimit words so our eyes don’t bleedThis happens to be R but also possible in the shell, Python, etc.
e.g. I’m saving a number of files of temperature data extracted at different resolutions (res) and for a number of months (month). Including these parameters in the filename allows me to use them to target files to read in.
write.csv(df, paste("variable", res, month, sep ="_"))
df <- read.csv(paste("variable", res, month, sep ="_"))
01_marshal-data.r
02_pre-dea-filtering.r
03_dea-with-limma-voom.r
04_explore-dea-results.r
90_limma-model-term-name-fiasco.r
02_pre-dea-filtering-preDE-filtering.png
03-dea-with-limma-voom-voom-plot.png
04_explore-dea-results-focus-term-adjusted-p-values1.png
04_explore-dea-results-focus-term-adjusted-p-values2.png
...
90_limma-model-term-name-fiasco-first-voom.png
90_limma-model-term-name-fiasco-second-voom.png
Use the ISO 8601 standard for dates: YYYY-MM-DD
If you don’t left pad, you get this:
10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R
which is just sad :(
Go forth and use awesome file names :)
Where shall I put my data?
myproject/
|
├── 01_data/
| ├── 01_raw/
| ├── 02_working/
| └── 03_clean/
|
├── 02_scripts/
|
├── 03_figures/
|
├── 04_paper/
|
├── 05_presentation/
|
├── readme.md
|
└── license.md
R (rrtools)analysis/
|
├── paper/
│ ├── paper.Rmd # this is the main document to edit
│ └── references.bib # this contains the reference list information
│
├── figures/ # location of the figures produced by the Rmd
|
├── data/
│ ├── raw_data/ # data obtained from elsewhere
│ └── derived_data/ # data generated during the analysis
|
└── templates
├── journal-of-archaeological-science.csl
| # this sets the style of citations & reference list
├── template.docx # used to style the output of the paper.Rmd
└── template.Rmd
Good to include:
requirements.txt, environment.yml etc..prj file (xml)renv.lock - use renv packageDon’t write your own dependency management.